Code
library(tidyverse)
library(ggplot2)
library(readr)
library(corrplot)
library(reshape2)
library(corrgram)
library(ggpubr)
library(caTools)
library(GGally)
Niharika Pola
December 1, 2022
What is takes for a country or a continent to be happy? Is it the economy, life-expectancy, freedom or the trust in the government? What are the factors that affect a country’s or continents overall happiness? Can we predict the happiness score? The curiosity to find answers to these questions made me explore the world happiness data of 2022.
“This year marks the 10th anniversary of the World Happiness Report, which uses global survey data to report how people evaluate their own lives in more than 150 countries worldwide. The World Happiness Report 2022 reveals a bright light in dark times. The pandemic brought not only pain and suffering but also an increase in social support and benevolence. As we battle the ills of disease and war, it is essential to remember the universal desire for happiness and the capacity of individuals to rally to each other’s support in times of great need.” - World Happiness Report 2022
The World happiness data tries to measure the happiness of the populace of every country and comes up with a score which connotes the level of happiness of the populace.
The data set uses various variables to measure happiness such as the GDP per capita, Freedom to make choices, life expectancy, the perception of corruption, generosity and social support.
In this study, I aim to find out answers to the following research questions:
I wish to test the following hypothesis,
primary <- read.csv("project datasets/World Happiness Report 2022.csv")
head(primary)
str(primary)
str(primary$Country)
The dataset that I have chosen is happiness 2022 dataset, one of Kaggle’s dataset. This dataset gives the happiness rank and happiness score of 147 countries around the world based on 8 factors including GDP per capita, Social support, Health life expectancy, freedom to make life choices, Generosity, Perceptions of corruption and dystopia residual. The higher value of each of these 8 factors means the level of happiness is higher. Dystopia is the opposite of utopia and has the lowest happiness level. Dystopia will be considered as a reference for other countries to show how far they are from being the poorest country regarding happiness level.
Source of the data: World Happiness Report 2022 use data from the Gallup World Poll surveys from 2019 to 2021. They are based on answers to the main life evaluation question asked in the poll.
Some of the variable names are not clear enough and I decided to change the name of several of them a little bit. Also, I will remove whisker low and whisker high variables from my dataset because these variables give only the lower and upper confidence interval of happiness score and there is no need to use them for visualization and prediction.
The next step is adding another column to the dataset which is continent. I want to work on different continents to discover whether there are different trends for them regarding which factors play a significant role in gaining higher happiness score. Asia, Africa, North America, South America, Europe, and Australia are our six continents in this dataset. Then I moved the position of the continent column to the second column because I think with this position arrange, dataset looks better. Finally, I changed the type of continent variable to factor to be able to work with it easily for visualization.
Error in colnames(primary) <- c("Happiness.Rank", "Country", "Happiness.Score", : object 'primary' not found
# Country: Name of countries
# Happiness.Rank: Rank of the country based on the Happiness Score
# Happiness.Score: Happiness measurement on a scale of 0 to 10
# Whisker.High: Upper confidence interval of happiness score
# Whisker.Low: Lower confidence interval of happiness score
# Economy: The value of all final goods and services produced within a nation in a given year
# Family: Importance of having a family
# Life.Expectancy: Importance of health and amount of time prople expect to live
# Freedom: Importance of freedom in each country
# Generosity: The quality of being kind and generous
# Trust: Perception of corruption in a government
# Dystopia.Residual: Plays as a reference
# Deleting unnecessary columns (Whisker.high and Whisker.low)
primary <- primary[, -c(4,5)]
Error in eval(expr, envir, enclos): object 'primary' not found
Error in primary$Continent <- NA: object 'primary' not found
primary$Continent[which(primary$Country %in% c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan Province of China",
"Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
"South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong S.A.R., China", "Philippines",
"Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
"Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
"Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
"Cambodia", "Afghanistan", "Yemen", "Syria"))] <- "Asia"
Error in primary$Continent[which(primary$Country %in% c("Israel", "United Arab Emirates", : object 'primary' not found
primary$Continent[which(primary$Country %in% c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
"Netherlands", "Sweden", "Austria", "Ireland", "Germany",
"Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
"Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
"Russia", "Lithuania", "Latvia", "Moldova", "Romania",
"Slovenia", "North Cyprus", "Cyprus", "Estonia", "Belarus",
"Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
"Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
"Bulgaria", "Albania", "Ukraine"))] <- "Europe"
Error in primary$Continent[which(primary$Country %in% c("Norway", "Denmark", : object 'primary' not found
Error in primary$Continent[which(primary$Country %in% c("Canada", "Costa Rica", : object 'primary' not found
Error in primary$Continent[which(primary$Country %in% c("Chile", "Brazil", : object 'primary' not found
Error in primary$Continent[which(primary$Country %in% c("New Zealand", : object 'primary' not found
Error in primary$Continent[which(is.na(primary$Continent))] <- "Africa": object 'primary' not found
Error in select(., Country, Continent, everything()): object 'primary' not found
Error in eval(expr, envir, enclos): object 'primary' not found
'data.frame': 51020 obs. of 10 variables:
$ id : num 1 2 3 4 5 6 7 8 9 10 ...
$ happy : Factor w/ 3 levels "not too happy",..: 1 1 2 1 2 2 1 1 2 2 ...
$ year : num 1972 1972 1972 1972 1972 ...
$ age : num 23 70 48 27 61 26 28 27 21 30 ...
$ sex : Factor w/ 2 levels "male","female": 2 1 2 2 2 1 1 1 2 2 ...
$ marital: Factor w/ 5 levels "married","never married",..: 2 1 1 1 1 2 3 2 2 1 ...
$ degree : Factor w/ 5 levels "lt high school",..: 4 1 2 4 2 2 2 4 2 2 ...
$ finrela: Factor w/ 5 levels "far below average",..: 3 4 3 3 4 4 4 3 3 2 ...
$ health : Factor w/ 4 levels "poor","fair",..: 3 2 4 3 3 3 4 3 4 2 ...
$ wtssall: num 0.445 0.889 0.889 0.889 0.889 ...
As we already know that the sum of these numeric variables gives the happiness score and there is an inverse relationship between happiness score and happiness rank. The higher the happiness score, the lower the happiness rank. Hence there is no need of looking at the correlation between each numeric variable and happiness rank. We can directly look at the happiness score and other numeric variables.
Error in ggcorr(dataset, label = TRUE, label_round = 2, label_size = 3.5, : object 'dataset' not found
According to the above cor plot, Economy, life expectancy, and family play the most significant role in contributing to happiness. Trust and generosity have the lowest impact on the happiness score.
Let’s calculate the average happiness score and the average of the other seven variables for each continent. Then melt it to have variables and values in separate columns. Finally, using ggplot to show the difference between continents.
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `Continent` is not found.
Error in melt(happy.Continent): object 'happy.Continent' not found
# Faceting
ggplot(happy.Continent.melt, aes(y=value, x=Continent, color=Continent, fill=Continent)) +
geom_bar( stat="identity") +
facet_wrap(~variable) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Average value of happiness variables for different continents",
y = "Average value")
Error in ggplot(happy.Continent.melt, aes(y = value, x = Continent, color = Continent, : object 'happy.Continent.melt' not found
We can see that Australia has approximately the highest average in all fields except dystopia residual, after that Europe, North America, and South America are roughly the same regarding happiness score and the other seven factors. Finally, Asia and Africa have the lowest scores in all fields.
Let’s see the correlation between variables for each continent.
Error in `filter()`:
! Problem while computing `..1 = Continent == "Africa"`.
Caused by error in `mask$eval_all_filter()`:
! object 'Continent' not found
Correlation between “Happiness Score” and the other variables in Africa:
Economy > Life Expectancy > Family > Dystopia Residual > Freedom
Happiness score and Generosity are inversely correlated.
There is no correlation between happiness score and trust.
Error in `filter()`:
! Problem while computing `..1 = Continent == "Asia"`.
Caused by error in `mask$eval_all_filter()`:
! object 'Continent' not found
Correlation between “Happiness Score” and the other variables in Asia:
Family > Dystopia.Residual > Freedom > Economy > Life Expectancy > Trust
There is no correlation between happiness score and Generosity.
Error in `filter()`:
! Problem while computing `..1 = Continent == "Europe"`.
Caused by error in `mask$eval_all_filter()`:
! object 'Continent' not found
Correlation between “Happiness Score” and the other variables in Europe:
Trust > Economy > Freedom > Life Expectancy > Dystopia Residual > Family > Generosity
Error in `filter()`:
! Problem while computing `..1 = Continent == "North America"`.
Caused by error in `mask$eval_all_filter()`:
! object 'Continent' not found
Correlation between “Happiness Score” and the other variables in North America:
Economy > Life Expectancy > Family > Generosity > Trust
There is an inverse correlation between happiness score and dystopia residual.
Error in `filter()`:
! Problem while computing `..1 = Continent == "South America"`.
Caused by error in `mask$eval_all_filter()`:
! object 'Continent' not found
Correlation between “Happiness Score” and the other variables in South America:
Dystopia.Residual > Trust > Family > Freedom > Life Expectancy
There is an inverse correlation between happiness score and dystopia residual, Generosity.
We will use scatter plot, box plot, and violin plot to see the happiness score distribution in different countries, how this score is populated in these continents and also will calculate the mean and median of happiness score for each of these continents.
####### Happiness score for each continent
gg1 <- ggplot(happy,
aes(x=Continent,
y=Happiness.Score,
color=Continent))+
geom_point() + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
gg2 <- ggplot(happy , aes(x = Continent, y = Happiness.Score)) +
geom_boxplot(aes(fill=Continent)) + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
gg3 <- ggplot(happy,aes(x=Continent,y=Happiness.Score))+
geom_violin(aes(fill=Continent),alpha=0.7)+ theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
# Compute descriptive statistics by groups
stable <- desc_statby(happy, measure.var = "Happiness.Score",
grps = "Continent")
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `Continent` is not found.
Error in eval(expr, envir, enclos): object 'stable' not found
Error in names(stable) <- c("Continent", "Mean of happiness score", "Median of happiness score"): object 'stable' not found
Error in as.matrix(d): object 'stable' not found
Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `FUN()`:
! object 'Continent' not found
Error in ggarrange(gg3, stable.p, ncol = 1, nrow = 2): object 'stable.p' not found
As we have seen before, Australia has the highest median happiness score. Europe, South America, and North America are in the second place regarding median happiness score. Asia has the lowest median after Africa. We can see the range of happiness score for different continents, and also the concentration of happiness score.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Life.Expectancy, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Life Expectancy)")
Error in happy$Continent: $ operator is invalid for atomic vectors
The correlation between life expectancy and happiness score in Europe, North America, and Asia is more significant than the other continents. Worth mentioning that we will not take Australia into account because there are just two countries in Australia and creating scatter plot with the regression line for this continent will not give us any insight.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Economy, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Economy)")
Error in happy$Continent: $ operator is invalid for atomic vectors
We can see pretty the same result here for the correlation between happiness score and economy. Africa has the lowest relationship in this regard.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Freedom, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Freedom)")
Error in happy$Continent: $ operator is invalid for atomic vectors
Freedom in Europe and North America is more correlated to happiness score than any other continents.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Trust, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Trust)")
Error in happy$Continent: $ operator is invalid for atomic vectors
Approximately there is no correlation between trust and happiness score in Africa.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Generosity, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Generosity)")
Error in happy$Continent: $ operator is invalid for atomic vectors
The regression line has a positive slope only for Europe and South America. For Asia the line is horizontal, and for Africa and North America the slope is negative.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Family, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Family)")
Error in happy$Continent: $ operator is invalid for atomic vectors
In South America with increase in the family score, the happiness score remains constant.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Dystopia.Residual, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Dystopia Residual)")
Error in happy$Continent: $ operator is invalid for atomic vectors
All continents act pretty the same regarding dystopia residual.
In this section, we will implement several machine learning algorithms to predict happiness score. First, we should split our dataset into training and test set. Our dependent variable is happiness score, and the independent variables are family, economy, life expectancy, trust, freedom, generosity, and dystopia residual.
Error in `[.data.frame`(happy, 4:11): undefined columns selected
Error in sample.split(dataset$Happiness.Score, SplitRatio = 0.8): object 'dataset' not found
Error in subset(dataset, split == TRUE): object 'dataset' not found
Error in subset(dataset, split == FALSE): object 'dataset' not found
Error in is.data.frame(data): object 'training_set' not found
Error in summary(regressor_lm): object 'regressor_lm' not found
The summary shows that all independent variables have a significant impact, and adjusted R squared is 1! As we discussed, it is clear that there is a linear correlation between dependent and independent variables. Again, I should mention that the sum of the independent variables is equal to the dependent variable which is the happiness score. This is the justification for having an adjusted R squared equals to 1. As a result, I guess Multiple Linear Regression will predict happiness scores with 100 % accuracy!
Error in predict(regressor_lm, newdata = test_set): object 'regressor_lm' not found
Error in cbind(Prediction = y_pred_lm, Actual = test_set$Happiness.Score): object 'y_pred_lm' not found
gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Multiple Linear Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
Error in ggplot(Pred_Actual_lm, aes(Actual, Prediction)): object 'Pred_Actual_lm' not found
Error in eval(expr, envir, enclos): object 'gg.lm' not found
As we expected, actual versus predicted plot shows the accuracy of our model.
Error in is.data.frame(data): object 'training_set' not found
Error in summary(regressor_lm_2): object 'regressor_lm_2' not found
Error in predict(regressor_lm_2, newdata = test_set): object 'regressor_lm_2' not found
Error in cbind(Prediction = y_pred_lm, Actual = test_set$Happiness.Score): object 'y_pred_lm' not found
gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Multiple Linear Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
Error in ggplot(Pred_Actual_lm, aes(Actual, Prediction)): object 'Pred_Actual_lm' not found
Error in eval(expr, envir, enclos): object 'gg.lm' not found
---
title: "Final Project Part-2"
author: "Niharika Pola"
description: "Final Project Part-2"
date: "12/1/2022"
format:
html:
df-print: paged
css: styles.css
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- Final Project Part-2
---
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(ggplot2)
library(readr)
library(corrplot)
library(reshape2)
library(corrgram)
library(ggpubr)
library(caTools)
library(GGally)
```
## Background/Motivation
What is takes for a country or a continent to be happy? Is it the economy, life-expectancy, freedom or the trust in the government? What are the factors that affect a country's or continents overall happiness? Can we predict the happiness score? The curiosity to find answers to these questions made me explore the world happiness data of 2022.
"This year marks the 10th anniversary of the **World Happiness Report**, which uses global survey data to report how people evaluate their own lives in more than 150 countries worldwide. The **World Happiness Report 2022** reveals a bright light in dark times. The pandemic brought not only pain and suffering but also an increase in social support and benevolence. As we battle the ills of disease and war, it is essential to remember the universal desire for happiness and the capacity of individuals to rally to each other's support in times of great need." - World Happiness Report 2022
## Research Question
The World happiness data tries to measure the happiness of the populace of every country and comes up with a score which connotes the level of happiness of the populace.
The data set uses various variables to measure happiness such as the GDP per capita, Freedom to make choices, life expectancy, the perception of corruption, generosity and social support.
In this study, I aim to find out answers to the following research questions:
1. What are the variables or factors that are affecting world's happiness, with a focus on individual countries & continents. This includes analyzing the correlation between most effective variables.
2. To find out which model accurately predicts the happiness score.
## Hypothesis
I wish to test the following hypothesis,
1. Better economy of a country would lead to happiness
2. Longer life expectancy would lead to happiness
3. Having family/social support leads to happiness
4. Freedom leads to happiness
5. People's trust in the Government leads to happiness
6. Generosity leads to happiness
## Data Preparation
### Reading the data set
```{r message=FALSE, warning=FALSE, error=FALSE}
primary <- read_csv("project datasets/World Happiness Report 2022.csv")
head(primary)
str(primary)
str(primary$Country)
```
The dataset that I have chosen is happiness 2022 dataset, one of Kaggle's dataset. This dataset gives the happiness rank and happiness score of 147 countries around the world based on 8 factors including GDP per capita, Social support, Health life expectancy, freedom to make life choices, Generosity, Perceptions of corruption and dystopia residual. The higher value of each of these 8 factors means the level of happiness is higher. Dystopia is the opposite of utopia and has the lowest happiness level. Dystopia will be considered as a reference for other countries to show how far they are from being the poorest country regarding happiness level.
Source of the data: World Happiness Report 2022 use data from the Gallup World Poll surveys from 2019 to 2021. They are based on answers to the main life evaluation question asked in the poll.
Some of the variable names are not clear enough and I decided to change the name of several of them a little bit. Also, I will remove whisker low and whisker high variables from my dataset because these variables give only the lower and upper confidence interval of happiness score and there is no need to use them for visualization and prediction.
The next step is adding another column to the dataset which is continent. I want to work on different continents to discover whether there are different trends for them regarding which factors play a significant role in gaining higher happiness score. Asia, Africa, North America, South America, Europe, and Australia are our six continents in this dataset. Then I moved the position of the continent column to the second column because I think with this position arrange, dataset looks better. Finally, I changed the type of continent variable to factor to be able to work with it easily for visualization.
## Preparation of the data
```{r message=FALSE, warning=FALSE, error=FALSE}
# Changing the name of columns
colnames (primary) <- c("Happiness.Rank", "Country", "Happiness.Score",
"Whisker.High", "Whisker.Low", "Dystopia.Residual", "Economy", "Family", "Life.Expectancy", "Freedom", "Generosity",
"Trust")
# Country: Name of countries
# Happiness.Rank: Rank of the country based on the Happiness Score
# Happiness.Score: Happiness measurement on a scale of 0 to 10
# Whisker.High: Upper confidence interval of happiness score
# Whisker.Low: Lower confidence interval of happiness score
# Economy: The value of all final goods and services produced within a nation in a given year
# Family: Importance of having a family
# Life.Expectancy: Importance of health and amount of time prople expect to live
# Freedom: Importance of freedom in each country
# Generosity: The quality of being kind and generous
# Trust: Perception of corruption in a government
# Dystopia.Residual: Plays as a reference
# Deleting unnecessary columns (Whisker.high and Whisker.low)
primary <- primary[, -c(4,5)]
```
```{r message=FALSE, warning=FALSE}
primary$Continent <- NA
primary$Continent[which(primary$Country %in% c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan Province of China",
"Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
"South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong S.A.R., China", "Philippines",
"Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
"Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
"Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
"Cambodia", "Afghanistan", "Yemen", "Syria"))] <- "Asia"
primary$Continent[which(primary$Country %in% c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
"Netherlands", "Sweden", "Austria", "Ireland", "Germany",
"Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
"Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
"Russia", "Lithuania", "Latvia", "Moldova", "Romania",
"Slovenia", "North Cyprus", "Cyprus", "Estonia", "Belarus",
"Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
"Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
"Bulgaria", "Albania", "Ukraine"))] <- "Europe"
primary$Continent[which(primary$Country %in% c("Canada", "Costa Rica", "United States", "Mexico",
"Panama","Trinidad and Tobago", "El Salvador", "Belize", "Guatemala",
"Jamaica", "Nicaragua", "Dominican Republic", "Honduras",
"Haiti"))] <- "North America"
primary$Continent[which(primary$Country %in% c("Chile", "Brazil", "Argentina", "Uruguay",
"Colombia", "Ecuador", "Bolivia", "Peru",
"Paraguay", "Venezuela"))] <- "South America"
primary$Continent[which(primary$Country %in% c("New Zealand", "Australia"))] <- "Australia"
primary$Continent[which(is.na(primary$Continent))] <- "Africa"
# Moving the continent column's position in the dataset to the second column
primary <- primary %>% select(Country,Continent, everything())
#Renaming the final dataframe to happy
happy <- primary
str(happy)
```
## Visualization
## Analyzing the correlation between each numeric variable
As we already know that the sum of these numeric variables gives the happiness score and there is an inverse relationship between happiness score and happiness rank. The higher the happiness score, the lower the happiness rank. Hence there is no need of looking at the correlation between each numeric variable and happiness rank. We can directly look at the happiness score and other numeric variables.
```{r message=FALSE, warning=FALSE, error=FALSE}
# Create a correlation plot
ggcorr(dataset, label = TRUE, label_round = 2, label_size = 3.5, size = 2, hjust = .85) +
ggtitle("Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))
```
According to the above cor plot, Economy, life expectancy, and family play the most significant role in contributing to happiness. Trust and generosity have the lowest impact on the happiness score.
### **Comparing different continents regarding their happiness variables**
Let's calculate the average happiness score and the average of the other seven variables for each continent. Then melt it to have variables and values in separate columns. Finally, using ggplot to show the difference between continents.
```{r message=FALSE, warning=FALSE, error=FALSE}
happy.Continent <- happy %>%
select(-3) %>%
group_by(Continent) %>%
summarise_at(vars(-Country), funs(mean(., na.rm=TRUE)))
# Or we can use aggregate
# aggregate(happy[, 4:11], list(happy$Continent), mean)
# Melting the "happy.Continent" dataset
happy.Continent.melt <- melt(happy.Continent)
# Faceting
ggplot(happy.Continent.melt, aes(y=value, x=Continent, color=Continent, fill=Continent)) +
geom_bar( stat="identity") +
facet_wrap(~variable) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Average value of happiness variables for different continents",
y = "Average value")
```
We can see that Australia has approximately the highest average in all fields except dystopia residual, after that Europe, North America, and South America are roughly the same regarding happiness score and the other seven factors. Finally, Asia and Africa have the lowest scores in all fields.
### **Correlation plot for each continent**
Let's see the correlation between variables for each continent.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggcorr(happy %>% select(-3) %>% filter(Continent == "Africa"), label = TRUE, label_round = 2, label_size = 3.5, size = 3.5, hjust = .85) +
ggtitle("Happiness Matrix for Africa") +
theme(plot.title = element_text(hjust = 0.5))
```
**Correlation between "Happiness Score" and the other variables in Africa:**\
Economy \> Life Expectancy \> Family \> Dystopia Residual \> Freedom
Happiness score and Generosity are inversely correlated.\
There is no correlation between happiness score and trust.\
```{r message=FALSE, warning=FALSE, error=FALSE}
ggcorr(happy %>% select(-3) %>% filter(Continent == "Asia"), label = TRUE, label_round = 2, label_size = 3.5, size = 3.5, hjust = .85) +
ggtitle("Happiness Matrix for Asia") +
theme(plot.title = element_text(hjust = 0.5))
```
**Correlation between "Happiness Score" and the other variables in Asia:**\
Family \> Dystopia.Residual \> Freedom \> Economy \> Life Expectancy \> Trust\
There is no correlation between happiness score and Generosity.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggcorr(happy %>% select(-3) %>% filter(Continent == "Europe"), label = TRUE, label_round = 2, label_size = 3.5, size = 3.5, hjust = .85) +
ggtitle("Happiness Matrix for Europe") +
theme(plot.title = element_text(hjust = 0.5))
```
**Correlation between "Happiness Score" and the other variables in Europe:**\
Trust \> Economy \> Freedom \> Life Expectancy \> Dystopia Residual \> Family \> Generosity\
```{r message=FALSE, warning=FALSE, error=FALSE}
ggcorr(happy %>% select(-3) %>% filter(Continent == "North America"), label = TRUE, label_round = 2, label_size = 3.5, size = 3.5, hjust = .85) +
ggtitle("Happiness Matrix for North America") +
theme(plot.title = element_text(hjust = 0.5))
```
**Correlation between "Happiness Score" and the other variables in North America:**\
Economy \> Life Expectancy \> Family \> Generosity \> Trust \
There is an inverse correlation between happiness score and dystopia residual.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggcorr(happy %>% select(-3) %>% filter(Continent == "South America"), label = TRUE, label_round = 2, label_size = 3.5, size = 3.5, hjust = .85) +
ggtitle("Happiness Matrix for South America") +
theme(plot.title = element_text(hjust = 0.5))
```
**Correlation between "Happiness Score" and the other variables in South America:**\
Dystopia.Residual \> Trust \> Family \> Freedom \> Life Expectancy\
There is an inverse correlation between happiness score and dystopia residual, Generosity.
### **Happiness score comparison on different continents**
We will use scatter plot, box plot, and violin plot to see the happiness score distribution in different countries, how this score is populated in these continents and also will calculate the mean and median of happiness score for each of these continents.
```{r message=FALSE, warning=FALSE, error=FALSE}
####### Happiness score for each continent
gg1 <- ggplot(happy,
aes(x=Continent,
y=Happiness.Score,
color=Continent))+
geom_point() + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
gg2 <- ggplot(happy , aes(x = Continent, y = Happiness.Score)) +
geom_boxplot(aes(fill=Continent)) + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
gg3 <- ggplot(happy,aes(x=Continent,y=Happiness.Score))+
geom_violin(aes(fill=Continent),alpha=0.7)+ theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
# Compute descriptive statistics by groups
stable <- desc_statby(happy, measure.var = "Happiness.Score",
grps = "Continent")
stable <- stable[, c("Continent","mean","median")]
names(stable) <- c("Continent", "Mean of happiness score","Median of happiness score")
# Summary table plot
stable.p <- ggtexttable(stable,rows = NULL,
theme = ttheme("classic"))
ggarrange(gg1, gg2, ncol = 1, nrow = 2)
```
```{r message=FALSE, warning=FALSE, error=FALSE}
ggarrange(gg3, stable.p, ncol = 1, nrow = 2)
```
As we have seen before, Australia has the highest median happiness score. Europe, South America, and North America are in the second place regarding median happiness score. Asia has the lowest median after Africa. We can see the range of happiness score for different continents, and also the concentration of happiness score.
## Find Relationship using Scatter Plot of Happiness Score with each variable (include regression line)
```{r message=FALSE, warning=FALSE, error=FALSE}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Life.Expectancy, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Life Expectancy)")
```
The correlation between life expectancy and happiness score in Europe, North America, and Asia is more significant than the other continents. Worth mentioning that we will not take Australia into account because there are just two countries in Australia and creating scatter plot with the regression line for this continent will not give us any insight.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Economy, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Economy)")
```
We can see pretty the same result here for the correlation between happiness score and economy. Africa has the lowest relationship in this regard.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Freedom, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Freedom)")
```
Freedom in Europe and North America is more correlated to happiness score than any other continents.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Trust, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Trust)")
```
Approximately there is no correlation between trust and happiness score in Africa.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Generosity, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Generosity)")
```
The regression line has a positive slope only for Europe and South America. For Asia the line is horizontal, and for Africa and North America the slope is negative.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Family, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Family)")
```
In South America with increase in the family score, the happiness score remains constant.
```{r message=FALSE, warning=FALSE, error=FALSE}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Dystopia.Residual, y = Happiness.Score)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line (Happiness Score & Dystopia Residual)")
```
All continents act pretty the same regarding dystopia residual.
## Prediction using Regression Models
In this section, we will implement several machine learning algorithms to predict happiness score. First, we should split our dataset into training and test set. Our dependent variable is happiness score, and the independent variables are family, economy, life expectancy, trust, freedom, generosity, and dystopia residual.
```{r message=FALSE, warning=FALSE, error=FALSE}
# Splitting the dataset into the Training set and Test set
set.seed(123)
dataset <- happy[4:11]
split = sample.split(dataset$Happiness.Score, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
```
```{r message=FALSE, warning=FALSE, error=FALSE}
# Fitting Multiple Linear Regression to the Training set
regressor_lm = lm(formula = Happiness.Score ~ .,
data = training_set)
summary(regressor_lm)
```
The summary shows that all independent variables have a significant impact, and adjusted R squared is 1! As we discussed, it is clear that there is a linear correlation between dependent and independent variables. Again, I should mention that the sum of the independent variables is equal to the dependent variable which is the happiness score. This is the justification for having an adjusted R squared equals to 1. As a result, I guess Multiple Linear Regression will predict happiness scores with 100 % accuracy!
```{r message=FALSE, warning=FALSE, error=FALSE}
####### Predicting the Test set results
y_pred_lm = predict(regressor_lm, newdata = test_set)
Pred_Actual_lm <- as.data.frame(cbind(Prediction = y_pred_lm, Actual = test_set$Happiness.Score))
gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Multiple Linear Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.lm
```
As we expected, actual versus predicted plot shows the accuracy of our model.
```{r message=FALSE, warning=FALSE, error=FALSE}
# Fitting Multiple Linear Regression to the Training set
regressor_lm_2 = lm(formula = happy$Happiness.Score ~ happy$Trust+happy$Life.Expectancy+ happy$Generosity + happy$Economy+ happy$Family + happy$Freedom+happy$Dystopia.Residual,
data = training_set)
summary(regressor_lm_2)
```
```{r message=FALSE, warning=FALSE, error=FALSE}
####### Predicting the Test set results
y_pred_lm = predict(regressor_lm_2, newdata = test_set)
Pred_Actual_lm <- as.data.frame(cbind(Prediction = y_pred_lm, Actual = test_set$Happiness.Score))
gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Multiple Linear Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.lm
```
\